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Abstract — In this paper, we present a method for optimal 
control synthesis of a plant that interacts with a set of agents 
in a graph-like environment. The control specification is given 
as a temporal logic statement about some properties that hold 
at the vertices of the environment. The plant is assumed to 
be deterministic, while the agents are probabilistic Markov 
models. The goal is to control the plant such that the probability 
of satisfying a syntactically co-safe Linear Temporal Logic 
formula is maximized. We propose a computationally efficient 
incremental approach based on the fact that temporal logic 
verification is computationally cheaper than synthesis. We 
present a case-study where we compare our approach to the 
classical non-incremental approach in terms of computation 
time and memory usage. 

I. Introduction 

Temporal logics [1], such as Linear Temporal Logic (LTL) 
and Computation Tree Logic (CTL), are traditionally used 
for verification of non-deterministic and probabilistic sys- 
tems [2]. Even though temporal logics are suitable for spec- 
ifying complex missions for control systems, they did not 
gain popularity in the control community until recently [3], 
[4], [5]. 

The existing works on control synthesis focus on specifica- 
tions given in linear time temporal logic. The systems, which 
sometimes are obtained through an additional abstraction 
process [3], [6], have finitely many states. With few excep- 
tions [7], their states are fully observable. For such systems, 
control strategies can be synthesized through exhaustive 
search of the state space. If the system is deterministic, 
model checking tools can be easily adapted to generate 
control strategies [4]. If the system is non-deterministic, the 
control problem can be mapped to the solution of a Rabin 
game [8], [6], or a simpler Buchi [9] or GR(1) game [10], 
if the specification is restricted to fragments of LTL. For 
probabilistic systems, the LTL control synthesis problem 
reduces to computing a control policy for a Markov Decision 
Process (MDP) [11], [12], [13]. 

In this work, we consider mission specifications expressed 
as syntactically co-safe LTL formulas [14]. We focus on a 
particular type of a multi-agent system formed by a determin- 
istically controlled plant and a set of independent, probabilis- 
tic, uncontrollable agents, operating on a common, graph- 
like environment. An illustrative example is a car (plant) 
approaching a pedestrian crossing, while there are some 
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pedestrians (agents) waiting to cross or already crossing the 
road. As the state space of the system grows exponentially 
with the number of pedestrians, one may not be able to utilize 
any of the existing approaches under computational resource 
constraints when there is a large number of pedestrians. 

We partially address this problem by proposing an in- 
cremental control synthesis method that exploits the inde- 
pendence between the components of the system, i.e., the 
plant modeled as a deterministic transition system and the 
agents, modeled as Markov chains, and the fact that verifi- 
cation is computationally cheaper than synthesis. We aim 
to synthesize a plant control strategy that maximize the 
probability of satisfying a mission specification given as 
a syntactically co-safe LTL formula. Our method initially 
considers a considerably smaller agent subset and synthesizes 
a control policy that maximizes the probability of satisfying 
the mission specification for the subsystem formed by the 
plant and this subset. This control policy is then verified 
against the remaining agents. At each iteration, we remove 
transitions and states that are not needed in subsequent iter- 
ations. This leads to a significant reduction in computation 
time and memory usage. It is important to note that our 
method does not need to run to completion. A sub-optimal 
control policy can be obtained by forcing termination at 
a given iteration if the computation is performed under 
stringent resource constraints. It must also be noted that our 
framework easily extends to the case when the plant is a 
Markov Decision Process, and we consider a deterministic 
plant only for simplicity of presentation. We experimentally 
evaluate the performance of our approach and show that 
our method clearly outperforms existing non-incremental 
approaches. Various methods that also use verification during 
incremental synthesis have been previously proposed in [15], 
[16]. However, the approach that we present in this paper is, 
to the best of our knowledge, the first use of verification 
guided incremental synthesis in the context of probabilistic 
systems. 

The rest of the paper is organized as follows: In Sec. [TTJ 
we give necessary definitions and some preliminaries in 
formal methods. The control synthesis problem is formally 
stated in Sec. [ill] and the solution is presented in Sec. [IV] 
Experimental results are included in Sec. |V| We conclude 
with final remarks in Sec. [Vll 

II. Preliminaries 

For a set E, we use |E| and 2 s to denote its cardinality 
and power set, respectively. A (finite) word uj over a set E is 



a sequence of symbols uj 
0,...,/. 



such that uj 1 G S Vz 



Definition II. 1 (Transition System). A transition system 
(TS) is a tuple T := (Qt, (Zt> *^t? <^t, ^t, IIt, £t)> where 

• Qt is a finite set of states; 

• Qt ^ Qt is the initial state; 

• At is a finite set of actions; 

• Q!t : Qt —> % is a map giving the set of actions 
available at a state; 

• St Q Qt x At x Qt is the transition relation; 

• IIt is a finite set of atomic propositions; 

• £>t ' Qt 2 nT is a satisfaction map giving the set of 
atomic propositions satisfied at a state. 

Definition II.2 (Markov Chain). A (discrete-time, labelled) 
Markov chain (MC) is a tuple M := (Qm, (Zm? ^m? IIm, Cm), 
where Qm, IIm, cmd Cm are the set of states, the set of 
atomic propositions, and the satisfaction map, respectively, 



as in Defi \II. 1\ and 

• g M G Qm is the initial state; 

• 5m ' Qm x Qm [0, 1] is the transition probability 
function that satisfies J^'eQ M 8(q,q') = 1 G Qm- 

In this paper, we are interested in temporal logic mis- 
sions over a finite time horizon and we use syntactically 
co-safe LTL formulas [17] to specify them. Informally, a 
syntactically co-safe LTL formula over the set II of atomic 
propositions comprises boolean operators -> (negation), V 
(disjunction) and A (conjunction), and temporal operators 
X (next), U (until) and F (eventually). Any syntactically 
co- safe LTL formula can be written in positive normal form, 
where the negation operator -> occurs only in front of atomic 
propositions. For instance, X p states that at the next position 
of the word, proposition p is true. The formula pi U p2 states 
that there is a future position of the word when proposition 
P2 is true, and proposition pi is true at least until p2 is true. 
For any syntactically co-safe LTL formula <j) over a set II, 
one can construct a FSA with input alphabet 2 n accepting 
all and only finite words over 2 n that satisfy (/>, which is 
defined next. 

Definition II.3 (Finite State Automaton). A (determin- 
istic) finite state automaton (FSA) is a tuple F := 
(Q F ,g£,E F ,£ F , Tp), where 

• Qp is a finite set of states ; 

• g F G Q F is the initial state; 

• E F is an input alphabet; 

• Sf ' Qf x S F x Q F is a deterministic transition relation; 

• Ff ^ Qf is a set of accepting (final) states. 

A run of F over an input word uj = uj^uj 1 . . .uj 1 where 
uj % G H F Vi = . . . I is a sequence r F = q°q 1 . . . q l q l+1 , 
such that (q\uj\q i+1 ) G S F \/i = . . . I and q° = g F . An 
FSA F accepts a word over S F if and only the corresponding 
run ends in some q G Tf- 

Definition II.4 (Markov Decision Process). A 

Markov decision process (MDP) is a tuple P := 
(Qp,gp,w4p,a P ,£p,IIp,£p), where 



• Qp is a finite set of states; 

• qp G Qp is the initial state; 

• Ap is a finite set of actions; 

• o;p : Qp — > 2 Ap is a map giving the set of actions 
available at a state; 

• Sp : Qp xAp x Qp — )► [0, 1] is the transition probability 
function that satisfies X^'eQ P ^fe a -> Q f ) = 1 G 
Q P ,a G ap(q) and E^eQ P S fa a ' = G 
Q P ,a a P (q). 

• lip is a finite set of atomic propositions; 

• Cp : Qp — » 2 np is a map giving the set of atomic 
propositions satisfied in a state. 

For an MDP P, we define a stationary policy fip : Qp — » 
Ap such that for a state q G Qp, ^p(q) G ap(q). This 
stationary policy can then be used to resolve all nonde- 
terministic choices in P by applying action fi(q) at each 
q G Qp. A path of P under policy jip is a finite sequence 
of states rp p = q°q 1 . . .q l such that I > 0, q° = q P 
and ^p(g fe " 1 ,/ip((7 fe " 1 ),g /c ) > Vfc G [1,1]. A path r£ p 
generates a finite word £p(rp P ) = Cp(q°)Cp(q 1 ) . . . Cp(q l ) 
where Cp(q k ) is the set of atomic propositions satisfied at 
state q k . Next, we use Pathsp P to denote the set of all paths 
of P under a policy fip. Finally, we define P MP ((/>) as the 
probability of satisfying <j> under policy fip. 

Remark II.5. Syntactically co-safe LTL formulas have infi- 
nite time semantics, thus they are actually interpreted over 
infinite words [17]. Measurability of languages satisfying 
LTL formulas is also defined for infinite words generated 
by infinite paths [2]. However, one can determine whether 
a given infinite word satisfies a syntactically co-safe LTL 
formula by considering only a finite prefix of it. It can be 
easily shown that our above definition of Paths'^ inherits 
the same measurability property given in [2]. 

III. Problem Formulation and Approach 

In this section we introduce the control synthesis problem 
with temporal constraints for a system that models a plant 
operating in the presence of probabilistic independent agents. 

A. System Model 

Consider a system consisting of a deterministic plant that 
we can control (e.g., a robot) and n agents operating in an 
environment modeled by a graph £ = (V, — H^), 
where V is the set of vertices, -^s^ V x V is the set of 
edges, and Cg is the labeling function that maps each vertex 
to a proposition in II^. For example, £ can be the quotient 
graph of a partitioned environment, where V is a set of labels 
for the regions in the partition and —>£ is the corresponding 
adjacency relation (see Figs. [T]|2]). Agent i is modeled as an 
MC M, = (Qi,q^Si,U u Ci), with Q, C V and ^ C-> 5 , 
i = l,...,n. The plant is assumed to be a deterministic 
transition system TS T = ( Qt , q T , At , <^t , , IIt , £>t) » 
where Qt C V and S t We assume that all components 

of the system (the plant and the agents) make transitions 
synchronously by picking edges of the graph. We also 
assume that the state of the system is perfectly known at 
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Fig. 1: A partitioned road environment, where a car (plant) is required to 
reach 04 without colliding with any of the pedestrians (agents). 



any given instant and we can control the plant but we have 
no control over the agents. 

We define the sets of propositions and labeling functions 
of the individual components of the system such that they 
inherit the propositions of their current vertex from the graph 
while preserving their own identities. Formally, we have 
n T = {(T,C e (q))\q e Qt} and C T (q) = (T,£ e (q)) for 
the plant, and 11* = {(i,Cg(q))\q G Qi} and d(q) = 
(i,£s(q)) for agent i. Finally, we define the set II of 
propositions as II = Hp U 11* U . . . U II n C {(i,p)|i = 

{T,o,...,n},pen^}. 

B. Problem Formulation 



As it will become clear in Sec. IV-D the joint behavior 
of the plant and agents in the graph environment can be 
modeled by the parallel composition of the TS and MC 
models described above, which takes the form of an MDP 



(see Def. 114 ). Given a syntactically co-safe LTL formula 
<fi over II, our goal is to synthesize a policy for this MDP, 
which we will simply refer to as the system, such that the 
probability of satisfying (j) is either maximized or above a 
given threshold. Since we assume perfect state information, 
the plant can implement a control policy computed for the 
system, i.e, based on its state and the state of all the other 
agents. As a result, we will not distinguish between a control 
policy for the plant and a control policy for the system, and 
we will refer to it simply as control policy. We can now 
formulate the main problem considered in this paper: 

Problem III.l. Given a system described by a plant T and 
a set of agents Mi, . . . , M n operating on a graph £, and 
given a specification in the form of a syntactically co-safe 
LTL formula (j) over II, synthesize a control policy /i* that 
satisfies the following objective: (a) If a probability threshold 
Pthr is given, the probability that the system satisfies <j) under 
/i* exceeds pthr- (b) Otherwise, /i* maximizes the probability 
that the system satisfies (j). If no such policy exists, report 
failure. 



As will be shown in Sec. IV- A the parallel composition 
of MDP and MC models also takes the form of an MDP. 
Hence, our approach can easily accommodate the case where 
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Fig. 2: TS T and MCs Mi . . . M 5 that model the car and the pedestrians. 
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Fig. 3: Deterministic FSA F that corresponds to <j> = ^colWend where 
col = Vi=i 5 j=o 4( C J A Cj) and end = cj . qo and q\ are initial and 
final states, respectively. 



the plant is a Markov Decision Process. We consider a 
deterministic plant only for simplicity of presentation. 

Example III.2. Fig. [7] illustrates a car in a 5-cell envi- 
ronment with 5 pedestrians, where Cg(v) = v for v G 
{cq,...,C4}. Fig. [2] illustrates the TS T and the MCs 
Mi, . . . , M5 that model the car and the pedestrians. The 
car is required to reach the end of the crossing ( C4) without 
colliding with any of the pedestrians. To enforce this behav- 
ior, we write our specification as 

V ((T, Cj -)A(i, Ci )) W(T >C4 ). (1) 

i=1...5,j=0...4 J 

The deterministic FSA that corresponds to <j) is given in 

Fig. p] where col = Vi=i...5,j=o...4(( T > c j) A (h c j)) and 
end=(T,c 4 ). 

C. Solution Outline 



One can directly solve Prob. III.l by reducing it to a Max- 
imal Reachability Probability (MRP) problem on the MDP 
modeling the overall system [18]. This approach, however, 
is very resource demanding as it scales exponentially with 
the number agents. As a result, the environment size and 
the number of agents that can be handled in a reasonable 
time frame and with limited memory are small. To address 
this issue, we propose a highly efficient incremental control 
synthesis method that exploits the independence between 
the system components and the fact that verification is less 
demanding than synthesis. At each iteration i, our method 
will involve the following steps: synthesis of an optimal 
control policy considering only some of the agents (Sec. |IV- 
|D| ), verification of this control policy with respect to the 
complete system (Sec.|IV-E|) and minimization of the system 



model under the guidance of this policy (Sec. IV-F). 



IV. Problem Solution 



Our solution to Prob. III. 1 is given in the form of Alg.[T] 
In the rest of this section, we explain each of its steps in 
detail. 

Algorithm 1: Incremental-Control-Synthesis 

Input: T,Mi,...,M n ,0, (p thr ). 

Output: fj* s.t. P^*0) > P^(0)V/i if pthr is not 
given, otherwise P M ((/>) > pthr- 
i^^{Mi,..,M n }. 
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Construct FSA F corresponding to <j). 
/x*^0,P&(0)<-O, M)<-0, A 



T, i <- 1. 



Process to form Mf ew . 
while 7rw£ do 

Ai <- Ai_i (8) M?™. 
Pi <- Ai <g> F. 

Synthesize /i^ that maximizes Pj^ . (0) using 
if pthr given then 

if Pfate) <Pthr then 
^Fail: j/i such that P /i (0) > p £/ir . 

else if Mi = M then 

^Success: /i* <— pi, Return p*. 

else 

^Continue with verification on line 20. 

else if Mi = M then 

[^Success: p* <— pi, Return p*. 

else 

Obtain the MC induced on by p^ 



Mi^M\Mi 



V, <- 



F. 



Compute Pj^{(j)) using V^. 
ifP^(0)>P^(0) then 
^^ M ,,P^(0)^P^(0). 

if p t/ir g/ven and P^ (0) > p £/ir then 
^Success: Return /i*. 

else 

Set to some agent £ JA % . 

Minimize A^. 
Increment i. 



A. Parallel Composition of System Components 

Given the set M = {Mi, . . . , M n } of all agents, we use 
Mi CM to denote its subset used at iteration i. Then, we 
define the synchronous parallel composition T (g) Mi of T 
and agents in Mi = {M^i, . . . , M^ } for different types of 
T as follows. 

If T is a TS, then we define T ® Mi as the MDP A = 

(Qa, q° A i -4a, <* a , £a, n A , C A ) = T <g> M i? such that 



• Sa ^ St x Qii x ... x Qij such that a state q = 
(QT^Qii: • • • :Qij) exists iff it is reachable from the initial 
states; 

. q\ = 

• <^a(#) = <^t(#t), where gx is the element of g that 
corresponds to the state of T; 

• n A = n T un a u...un^; 

. C A (q) C T (q T ) U Cn(qn) U . . . U Ajfej); 

• Mtf (<?T,g*i,. • •,%), a, g' = (gx,^!,...,^)) 

l{(g T ,a,g T ) G M x x ... x S(q ij ,q f ij ), 

where 1{-} is the indicator function. 

If T is an MDP, then we define T (g) Mi as the MDP 
A = (Q A , q%, Aa, <*a, ^a, n A , Ca) = T <g) .M*, such that 
Qa, (7a ' <^a, I^a, and Ca are as given in the case where 
T is a TS and 

• S A (q = Or, qa, a, q' = (g T , g- l5 . . . , q^)) = 
5 T (q T ,a,q f T ) x x ... x <%(g^, g^). 

Finally if T is an MC, then we define T (g) .M* as the MC 
A = (Q A , q° A , 5a, n A , £a) = T0 where Qa, ^, n A , 
Ca are as given in the case where T is a TS and 

• S A (q = (<7t,^i,...,%),^ (<2x,^i,...,^-)) = 
M<?T,g T ) x x ... x S ij (q ij ,q , ij ). 

B. Product MDP and Product MC 

Given the deterministic FSA F that recognizes all and 
only the finite words that satisfy </>, we define the product of 
M (g) F for different types of M as follows. 

If M is an MDP, we define M <g> F as the product MDP 
P = (Qp,gp,^P,ap,£ P ,n P ,£p) = M<g>F, where 

• Qp ^ Qm x Q F such that a state q exists iff it is 
reachable from the initial states; 

• Qp = (<?m,4f) such that (<? F , £ m (<2m), qp) € *f; 

• Ap — Am.\ 

• «p((<7m,^f)) = «m(<7m); 

• Ftp = n M ; 

• C P ((q M ,qF)) = £ m (<2m); 

• Sp((qM,qF),a,(q^q^)) = l{(g F , ^m(^m)^f) e M 
x^m(<7m,«,^m)' 

where 1{-} is the indicator function. In this product MDP, 
we also define the set Tp of final states such that a state 
q = ((Zm? (Zf) £ ^p iff ^f £ ^f, where is the set of final 
states of F. 

If M is an MC, we define M (g) F as the product MC 
P = (Qp,$,(Sp,IIp,£p) = M^F where Q P , g P , n P , 
Cp are as given in the case where M is an MDP and 

• ^p((9M,9f),(9m^f)) = 1{(^F, ^m(^m)' ^f) G M X 

In this product MC, we also define the set Tp of final states 
as given above. 

C. Initialization 

Lines 1 to 4 of Alg. [T] correspond to the initialization 
procedure of our algorithm. First, we form the set M = 
{Mi, . . . , M n } of all agents and construct the FSA F that 
corresponds to <p. Such F can be automatically constructed 



using existing tools, e.g., [19]. Since we have not synthesized 
any control policies so far, we reset the variable /i* that holds 
the best policy at any given iteration and set the probability 
Pjlt ((f)) of satisfying <j) under policy /i* in the presence of 
agents in M to 0. As we have not considered any agents so 
far, we set the subset Mo to be an empty set. We then set 
Ao, which stands for the parallel composition of the plant T 
and the agents in Mo, to T. We also initialize the iteration 
counter i to 1. 

Line 4 of Alg. [T] initializes the set Mi ew of agents that 
will be considered in the synthesis step of the first iteration of 
our algorithm. In order to be able to guarantee completeness, 
we require this set to be the maximal set of agents that satisfy 
the mission, i.e., the agent subset that can satisfy <p but not 
strictly needed to satisfy 0. To form Mi ew , we first rewrite 
<p in positive normal form to obtain (j) pn f, where the negation 
operator -< occurs only in front of atomic propositions. Con- 
version of (j) to (j) pn f can be performed automatically using 
De Morgan's laws and equivalences for temporal operators 
as given in [2]. Then, using this fact, we include an agent 
M.i G M in Mi ew if any of its corresponding propositions 
of the form (i,p) : p G Hi appears non-negated in (j) pn f. For 
instance, given := -.((3, p3)A(T, p3)) U ((1, pi) V(2, p2)), 
either one of agents Mi and M 2 can satisfy the formula, 
whereas agent M3 can only violate it. Therefore, for this 
example we set M^ ew = {Mi,M 2 }. In case M^ ew = 
after this procedure, we form Mi ew arbitrarily by including 
some agents from M and proceed with the synthesis step of 
our approach. 

D. Synthesis 

Lines 6 to 19 of Alg. [T] correspond to the synthesis step 
of our algorithm. At the i th iteration, the agent subset that 
we consider is given by Mi = Mi-i U Mf ew where 
M^ ew contains the agents that will be newly considered as 
provided by the previous iteration's verification stage or by 



the initialization procedure given in Sec. IV-C if i is 1. First, 
we construct the parallel composition A^ = A^_i (g) M^ ew 



of our plant and the agents in Mi as described in Sec. |IV-A 
Notice that, we use A^_i to save from computation time and 
memory as A^_i M™ ew is typically smaller than T®Aij 



due to the minimization procedure explained in Sec. IV- 
|F| Next, we construct the product MDP = A^ <g> F as 
explained in Sec. IV-B Then, our control synthesis problem 
can be solved by solving a maximal reachability probability 
(MRP) problem on where one computes the maximum 
probability of reaching the set from the initial state 
q P [18], after which the corresponding optimal control policy 
fii can be recovered as given in [2], [13]. Consequently, at 
line 9 of Alg. [T] we solve the MRP problem on P^ using 
value iteration to obtain optimal policy /i^ that maximizes the 
probability of satisfaction of <j) in the presence of the agents 
in Mi. We denote this probability by Pjj^.(0), whereas 
P M ((/>) stands for the probability that the complete system 
satisfies <p under policy [i. 

The steps that we take at the end of the synthesis, i.e., lines 
10 to 19 of Alg. [T] depends on whether p t h r is given or not. 



At any iteration i, if p t h r is given and Pj^.(</>) < pthr, we 
terminate by reporting that there exists no control policy ji : 



P^(cf)) > p thr which is a direct consequence of Prop. IV. 1 If 
Pthr is given and Pj^ ((/>) > pthr, we consider the following 
cases. If Mi = M, we set /i* to ^ and return ji* as it 
satisfies the probability threshold. Otherwise, we proceed 
with the verification of fii as there are remaining agents 
that were not considered during synthesis and can potentially 
violate <p. For the case where p t h r is not given we consider 
the current agent subset Mi. If Mi = M we terminate and 
return /i* as there are no agents left to consider. Otherwise, 
we proceed with the verification stage. 

Proposition IV. 1. The sequence {Pj^. (</>)} is non- 
increasing. 



Proof. As given in Sec. IV-C M\ includes all those agents 
that can satisfy the propositions that lead to satisfaction of 
(j). Let pref((j)) be the set of finite words that satisfy <p and 
let MC Mj of agent j be such that Mj f£ Mi. Consider 
a finite satisfying word a such that a = a a 1 . . .a 1 G 
pref((j)). Suppose there exists an index k G {0,...,/} 
such that for some q G Qj and Cj(q) G a k . Then, a 



y k 1 pi. k — k -\- 1 



a 1 is also in pref{(j)) where a' 



~k 



a k \ Cj(q). Now, let r = q°q 1 . . . q l be a path of the system 
after including Mj. Let u = C{r) = uo^uj 1 . . . uj 1 be the word 

also 



COW. 



generated by r. If uj satisfies <j), then 
satisfies </> where Cj k = uj k \ £j(q k ) for each k G {0, ...,/} 
and q k is the state of Mj in q k . Thus, we conclude that 
the set of paths that satisfy <p cannot increase after we add 
agent Mj G M \ Mi, and the sequence {Pj^ .(</>)} is non- 
increasing such that it attains its maximum value (</>) at 
the first iteration and does not increase as more agents from 
M \ M\ are considered in the following iterations. ■ 

Corollary IV.2. If at any iteration Pj^.(0) < pthr> th en 
there does not exist a policy \i : P /i (0) > pthr> where \±i is 
an optimal control policy that we compute at the synthesis 
stage of the i th iteration considering only the agents in Mi. 

E. Verification and Selection of Mf™ 

Lines 20 to 30 of Alg. [T] correspond to the verification 
stage of our algorithm. In the verification stage, we verify 
the policy fii that we have just synthesized considering the 
entire system and accordingly update the best policy so far, 
which we denote by fi* . 

Note that ^ maximizes the probability of satisfying <p 
in the presence of agents in Mi and induces an MC by 
resolving all non-deterministic choices in P^. Thus, we first 
obtain the induced Markov Chain that captures the 

joint behavior of the plant and the agents in Mi under policy 
fii. Then, we proceed by considering the agents that were 
not considered during synthesis of Hi, i.e., agents in Mi = 
M\Mi. In order to account for the existence of the agents 
that we newly consider, we exploit the independence between 



the systems and construct the MC 



M M Z ® Mi in 



line 22. In lines 23 and 24 of Alg.[T] we construct the product 
MC Vi = (g) F and compute the probability Pj^O) of 



satisfying <j) in the presence of all agents in M by computing 
the probability of reaching V^'s final states from its initial 
state using value iteration. Finally, in lines 25 and 26 we 
update /i* so that /x* = /x* if Pj^W > p m(<I>)> lf we 
have a policy that is better than the best we have found so far. 
Notice that, keeping track of the best policy /x* makes Alg. [I] 
an anytime algorithm, i.e., the algorithm can be terminated 
as soon as some /x* is obtained. 

At the end of the verification stage, if p t hr is given and 
Pjii (4>) > Pthr we terminate and return /x*, as it satisfies the 
given probability threshold. Otherwise in line 30 of Alg. [T] 
we pick an arbitrary Mj £ Mi to be included in M % , 
which we call the random agent first (RAF) rule. Note that, 
one can also choose to pick the smallest Mj in terms of state 
and transition count to minimize the overall computation 
time, which we call the smallest agent first (SAF) rule. 

Proposition IV.3. The sequence {Pj^ (</>)} is a non- 
decreasing sequence. 

Proof. The result directly follows from the fact that /x* is 
set to m if and only if P$ (0) > p£ (0). ■ 

F. Minimization 

The minimization stage of our approach (line 31 in Alg.[T]) 
aims to reduce the overall resource usage by removing those 
transitions and states of A$ that are not needed in the 
subsequent iterations. We first set the minimization threshold 
Pmin to pthr if given, otherwise we set it to Pj^ ((/>). Next, 
we iterate over the states of and check the maximum 
probability of satisfying the mission under each available 
action. Note that, the value iteration that we perform in 
the synthesis step already provides us with the maximum 
probability of satisfying (j) from any state in P^. Then, we 
remove an action a from state qx in if for all qp G Qf, 
the maximum probability of satisfying the mission by taking 
action a at (q^, qp) in P^ is below p m i n . After removing the 
transitions corresponding to all such actions, we also prune 
any orphan states in A$, i.e., states that are not reachable 
from the initial state. Then, we proceed with the synthesis 
stage of the next iteration. 

Proposition IV.4. Minimization phase does not affect the 
correctness and the completeness of our approach. 

Proof. To prove the correctness, we need to show that for 
an arbitrary policy /x on the minimized MDP A m ^ n , the 
probability that A min satisfies <p under /x is equal to the 
probability that A satisfies (j) under /x where A is the original 
MDP before minimization. Correctness, in this case, follows 
directly from the fact that, in each state q, we do not 
modify the transition probabilities associated with an action 
that is enabled in q after minimization. Thus, it remains 
to show that minimization does not affect the completeness 
of the approach. We first consider the removal of orphaned 
states. Since these states cannot be reached from the initial 
state, they also will not be a part of any feasible control 
policy, and their removal does not affect the completeness 
of the approach. Finally, we consider the removal of those 



actions that drive the system to the set of target states with 
probability smaller than the minimization threshold. For the 
case where we use pthr, completeness is not affected as we 
remove only those transitions that we would not take as we 
are looking for control policies with P^ (0) > p t hr and 

. For the case 



where we use Pj^ 
we would not take as P^ 



IV. 1 



Pj^j . is a non-increasing sequence (Prop. 

we also remove those transitions that 
is a non-decreasing sequence 
(Prop. |IV.3| ). Hence, the minimization procedure does not 



affect the completeness of the overall approach as well. 



We finally show that Alg. [T] correctly solves Prob. III. 1 
Proposition IV.5. Alg. [7] solves Prob. III.l 



Proof. Alg. [T] combines all the steps given in this section 
and synthesizes a control policy /x* that either ensures 
P M (0) > Pthr if Pthr is given, or maximizes P M (</>). If 
Alg. [T] terminates in line 12, completeness is guaranteed by 
the fact that Pj^j. is a non-increasing sequence as given 
in Prop. |IV.l Also, as given in Prop. IV.4 



minimization 



stage does not affect the correctness and completeness of 
the approach. Thus, Alg. [T] solves Prob. III. 1 } ■ 



V. Experimental Results 

In this section we return to the pedestrian crossing problem 
given in Example [ill. 2 and illustrated in Figs.[T][2j The mis- 
sion specification <fi for this example is given in Eq. ([T]). In the 
following, we compare the performance of our incremental 
algorithm with the performance of the classical method that 
attempts to solve this problem in a single pass using value 
iteration as in [18]. 

In our experiments we used an iMac i5 quad-core desktop 
computer and considered C++ implementations of both ap- 
proaches. During the experiments, our algorithm picked the 
new agent M^ ew to be considered at the next iteration in 
the following order: Mi, M 2 , M 3 , M 4 , M 5 , i.e., according 
to the smallest agent first rule given in Sec. |IV-E| 

When no p t hr was given, optimal control policies synthe- 
sized by both of the algorithms satisfied (j) with a probability 
of 0.8. The classical approach solved the control synthesis 
problem in 6.75 seconds, and the product MDP on which 
the MRP problem was solved had 1004 states and 26898 
transitions. In comparison, our incremental approach solved 
the same problem in 4.44 seconds, thanks to the minimization 
stage of our approach, which reduced the size of the problem 
at every iteration by pruning unneeded actions and states. The 
largest product MDP on which the MRP problem was solved 
in the synthesis stage of our approach had 266 states and 
4474 transitions. The largest product MC that was considered 
in the verification stage of our approach had 405 states 
and 6125 transitions. The probabilities of satisfying <\> under 
policy fa obtained at each iteration of our algorithm were 
P£(0) = 0.463, P^) = 0.566, P$(0) = 0.627, 

P M^) = °' 667 ' and P M^) = °' 8 ' When Pthr Was 

given as 0.65, our approach finished in 3.63 seconds and 
terminated after the fourth iteration returning a sub-optimal 
control policy with a 0.667 probability of satisfying <p. In 
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Fig. 4: Comparison of the classical single-pass and proposed incremental 
algorithms. The top plot shows the running times of the algorithms and 
the probabilities of satisfying <f> under synthesized policies. The bottom 
plot compares the state counts of the product MDPs on which the MRP 
problem was solved in both approaches (black and red lines) and shows the 
state count of the product MC considered in the verification stage of our 
incremental algorithm (red dashed line). 



this case, the largest product MDP on which the MRP 
problem was solved had only 99 states and 680 transitions. 
Furthermore, since our algorithm runs in an anytime manner, 
it could be terminated as soon as a control policy was 
available, i.e., at the end of the first iteration (1.25 seconds). 
Fig. [4] compares the classical single-pass approach with our 
incremental algorithm in terms of running time and state 
counts of the product MDPs and MCs. 

It is interesting to note that state count of the product 
MDP considered in the synthesis stage of our algorithm 
increases as more agents are considered, whereas state count 
of the product MC considered in the verification stage of 
our algorithm decreases as the minimization stage removes 
unneeded states and transitions after each iteration. It must 
also be noted that, \M\\, i.e., cardinality of the initial agent 
subset, is an important factor for the performance of our 
algorithm. As discussed in this section, for \M\\ « \M\ 
our algorithm outperforms the classical method both in 
terms of running time and memory usage. However, for 
| .Mi | ^ \M\ we expect the resource usage of our algorithm 



to be close to that of the classical approach, as in this case 
almost all of the agents will be considered in the synthesis 
stage of the first iteration. We plan to address this issue in 
future work. Nevertheless, most typical finite horizon safety 
missions, where the plant is expected to reach a goal while 
avoiding a majority or all of the agents, already satisfy the 
condition \Mi\ « \M\. 

VI. Conclusions 

In this paper we presented a highly efficient incremental 
method for automatically synthesizing optimal control poli- 
cies for a system comprising a plant and multiple indepen- 
dent agents, where the plant is expected to satisfy a high 
level mission specification in the presence of the agents. We 
considered independent agents modeled as Markov chains 
and assumed that the plant was modeled as a determin- 
istic transition system. However, our approach is general 
enough to accommodate plants modeled as Markov Deci- 
sion Processes. For mission specifications, we considered 
syntactically co-safe Linear Temporal Logic formulas over 
a set of propositions that are satisfied by the components of 
the system. If a probability threshold is given, our method 
exploits this knowledge to terminate earlier and returns a sub- 
optimal control policy. Otherwise, our method synthesizes 
an optimal control policy that maximizes the probability of 
satisfying the mission. Since our method does not need to 
run to completion, it has practical value in applications where 
a safe control policy must be synthesized under resource 
constraints. For future work, we plan to extend our approach 
to mission specifications expressed in full LTL as opposed 
to a subset of it. 
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